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Abstract 

We  describe  an  efficient  implementation  of  an  algorithm  for  computing  selected 
elements  of  a  general  sparse  symmetric  matrix  A  that  can  be  decomposed  as  A  =  LDLT , 
where  L  is  lower  triangular  and  D  is  diagonal.  Our  implementation,  which  is  called 
Sellnv,  is  built  on  top  of  an  efficient  supernodal  left-looking  LDLT  factorization  of  A. 
We  discuss  how  computational  efficiency  can  be  gained  by  making  use  of  a  relative  index 
array  to  handle  indirect  addressing.  We  report  the  performance  of  Sellnv  on  a  collection 
of  sparse  matrices  of  various  sizes  and  nonzero  structures.  We  also  demonstrate  how 
Sellnv  can  be  used  in  electronic  structure  calcuations. 


1  Introduction 

In  some  scientific  applications,  we  need  to  calculate  a  subset  of  the  entries  of  the  inverse 
of  a  given  matrix.  A  particularly  important  example  is  in  the  electronic  structure  analysis 
of  materials  using  algorithms  based  on  pole  expansion  [18,  20]  where  the  diagonal  and 
sometimes  sub-diagonals  of  the  discrete  Green’s  function  or  resolvent  matrices  are  needed 
in  order  to  compute  the  electron  density.  Other  examples  in  which  particular  entries  of  the 
Green’s  functions  are  needed  can  also  be  found  in  the  perturbation  analysis  of  impurities 
by  solving  Dyson’s  equation  in  solid  state  physics  [10],  or  the  calculation  of  retarded  and 
less-than  Green’s  function  in  electronic  transport  [6] .  We  will  call  this  type  of  calculations 
a  selected  inversion  of  a  matrix. 
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From  a  computational  viewpoint,  it  is  natural  to  ask  whether  one  can  develop  algorithms 
for  selected  inversion  that  is  faster  than  inverting  the  whole  matrix.  This  is  possible  at  least 
in  some  cases,  for  example,  when  A  is  obtained  from  a  finite  difference  discretization  of 
a  Laplacian  operator  or  from  lattice  models  in  statistical  or  quantum  mechanics  with  a 
local  Hamiltonian.  For  such  matrices,  a  fast  sequential  algorithm  has  been  proposed  to 
extract  the  diagonal  or  sub-diagonal  elements  of  their  inverse  matrices  [19] .  The  complexity 
of  the  fast  algorithm  is  0(n A5)  for  two  dimensional  (2D)  problems  and  0(n2)  for  three 
dimensional  (3D)  problems,  with  n  being  the  dimension  of  A.  This  is  lower  than  the  0(n3) 
complexity  associated  with  a  direct  inversion  of  the  full  matrix.  The  algorithm  presented  in 
[19]  contains  two  steps.  The  first  step  produces  an  LDLT  factorization  of  the  input  matrix 
A.  The  second  step  uses  L  and  D  matrices  to  compute  the  selected  components  of  A _1.  In 
the  following,  we  will  simply  refer  to  the  first  step  as  factorization,  and  the  second  step  as 
selected  inversion.  The  parallelization  of  this  algorithm  on  a  distributed  memory  machine  is 
described  in  the  recent  work  [21].  We  have  used  the  parallel  algorithm  to  perform  a  selected 
inversion  a  2D  Laplacian  of  dimension  4.3  billion  on  4,096  processors. 

The  design  and  implementation  of  the  fast  algorithm  proposed  in  [19]  and  [21]  depend 
explicitly  on  the  domain  shape  and  discretization  stencil  for  the  Laplacian  operator.  This 
leads  to  an  efficient  implementation,  but  restricts  the  application  of  the  algorithm.  On 
the  other  hand,  LDLT  factorization  is  a  general  concept.  Therefore,  it  is  natural  to  ask 
whether  it  is  possible  for  the  selected  inversion  algorithm  to  be  generalized  to  any  nonsin¬ 
gular  symmetric  matrix.  This  question  was  investigated  in  [11],  and  discussed  recently  in 
[17,  27,  21].  However,  no  efficient  software  package  is  currently  available  for  computing  a 
selected  inversion  of  a  general  sparse  symmetric  matrix  that  admits  an  LDLT  factorization. 
The  present  paper  intends  to  fill  such  a  gap  by  describing  an  efficient  algorithm  and  its 
implementation  for  such  a  task.  The  algorithm  and  its  implementation  described  here  will 
be  called  Sellnv. 

Our  paper  is  organized  as  follows.  We  begin  with  the  description  of  some  basic  concepts 
underlying  a  selected  inversion  algorithm  in  section  2,  and  discuss  why  the  complexity  of 
the  algorithm  can  be  made  lower  than  0(n3)  when  A  is  sparse.  We  discuss  the  use  of 
supernodes  and  block  algorithms  in  section  3,  which  is  key  to  achieving  high  performance. 
The  implementation  details  of  Sellnv  are  provided  in  section  4.  In  particular,  we  show 
how  a  relative  index  array  similar  to  that  used  in  a  sparse  LDLT  factorization  is  set  up  to 
handle  indirect  addressing  efficiently.  In  section  5,  we  report  the  performance  of  Sellnv  on 
a  collection  of  sparse  matrices.  We  also  demonstrate  how  Sellnv  can  be  used  in  electronic 
structure  calculations  in  section  6. 

Standard  linear  algebra  notation  is  used  for  vectors  and  matrices  throughout  the  paper. 
We  use  Ahj  to  denote  the  th  element  of  A.  Block  indices  are  denoted  by  uppercase 
script  letters  T,  J  etc..  Occasionally,  we  use  a  MATLAB  [25]  script  to  describe  a  simple 
algorithm.  In  particular,  we  use  the  MATLAB-style  notation  A(i:j,k:l)  to  denote  a 
submatrix  of  A  that  consists  of  rows  i  through  j  and  columns  k  through  l.  Another  notion 
we  use  to  denote  such  a  submatrix  is  A^j^.j.  Furthermore,  we  use  and  A*j  to  denote 
the  7-th  row  and  the  ji-th  column  of  A  respectively.  Similarly,  Ax*  and  A*  j-  are  used  to 
denote  the  X-th  block  row  and  the  77-th  block  column  of  A  respectively. 
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2  Selected  Inversion:  Basic  Idea 


An  obvious  way  to  obtain  selected  components  of  A~1  is  to  compute  A^1  first  and  then 
simply  pull  out  the  needed  entries.  The  standard  approach  for  computing  A _1  is  to  first 
decompose  A  as 

A  =  LDLt ,  (1) 

where  L  is  a  unit  lower  triangular  matrix  and  D  is  a  diagonal  matrix.  Equation  (1)  is 
often  known  as  the  LDLT  factorization  of  A.  Given  such  a  factorization,  one  can  obtain 
A~l  =  (xi,  X2,  •  •  • ,  x'n)  by  solving  a  number  of  triangular  systems 

Ly  =  Cj,  Dw  =  y,  LTXj  =  w,  (2) 


for  j  =  1,2, ...  ,n,  where  ej  is  the  j th  column  of  the  identity  matrix  I. 

An  alternative  algorithm  presented  in  [21]  and  summarized  below  also  performs  an 
LDLt  factorization  of  A  first.  However,  the  algorithm  does  not  require  solving  (2)  directly. 
Before  we  present  this  algorithm,  it  will  be  helpful  to  first  review  the  major  operations 
involved  in  the  LDLT  factorization  of  A. 

Let 


A  = 


a  aT  \ 
a  A  )  ' 


(3) 


be  a  nonsingular  symmetric  matrix.  The  first  step  of  an  LDLT  factorization  produces  a 
decomposition  of  A  that  can  be  expressed  by 


A  = 


1 

£  I 


a 


A  —  aaT /a 


1  F 

I 


where  a  is  often  referred  to  as  a  pivot,  £  =  a/ a  and  S  =  A  —  aaT /a  is  known  as  the  Schur 
complement.  The  same  type  of  decomposition  can  be  applied  recursively  to  the  Schur 
complement  S  until  its  dimension  becomes  1.  The  product  of  lower  triangular  matrices 
produced  from  the  recursive  procedure,  which  all  have  the  form 


I 

1 

£^  I 

where  £^  =  £  =  a/a ,  yields  the  final  L  factor.  The  (1, 1)  entry  of  each  Schur  complement 
together  with  a  become  the  diagonal  entries  of  the  D  matrix. 

To  simplify  our  discussion,  we  assume  here  that  all  pivots  produced  in  the  LDLT  fac¬ 
torization  are  sufficiently  large  so  that  no  row  or  column  permutation  (pivoting)  is  needed 
during  the  factorization. 

The  key  observation  made  in  [19]  and  [21]  is  that  A-1  can  be  expressed  by 

a-1+£TS~1£  -£tS 
-S~l£  S~l 


This  expression  suggests  that  once  a  and  £  are  known,  the  task  of  computing  A  1  can  be 
reduced  to  that  of  computing  S -1. 
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Because  a  sequence  of  Schur  complements  are  produced  recursively  in  the  LDLT  factor¬ 
ization  of  A.  the  computation  of  A _1  can  be  organized  in  a  recursive  fashion  also.  Clearly, 
the  reciprocal  of  the  last  entry  of  D  is  the  (n,  n)-th  entry  of  A _1.  Starting  from  this  entry, 
which  is  also  the  lxl  Schur  complement  produced  in  the  ( n  —  l)-th  step  of  the  LDLT  fac¬ 
torization  procedure,  we  can  construct  the  inverse  of  the  2x2  Schur  complement  produced 
at  the  (n  — 2)-th  step  of  the  factorization  procedure,  using  the  recipe  given  by  (4).  This  2x2 
matrix  is  the  trailing  2x2  block  of  A-1.  As  we  proceed  from  the  lower  right  corner  of  L 
and  D  towards  their  upper  left  corner,  more  and  more  elements  of  A -1  are  recovered.  The 
complete  procedure  can  be  easily  described  by  a  MATLAB  script  shown  in  Algorithm  1. 


Algorithm  1  A  MATLAB  script  for  computing  the  inverse  of  a  dense  matrix  A  given  its 

LDLt  factorization. _ 

Ainv(n,n)  =  1/D(n,n); 
for  j  =  n-1 : -1 : 1 

Ainv(j+1 :n, j )  =  -Ainv(j+1 :n, j+1 :n)*L(j+l :n, j) ; 

Ainv(j , j+1 :n)  =  Ainv(j+1 :n, j) 1 ; 

Ainv(j,j)  =  1/D ( j , j )  -  L(j+l:n,j) ’ *Ainv( j+1 :n, j ) ; 
end; 


For  clarity  purpose,  we  use  a  separate  array  Ainv  in  Algorithm  1  to  store  the  computed 
A-1.  In  practice,  A-1  can  be  computed  in  place.  That  is,  we  can  overwrite  the  array  used 
to  store  L  and  D  with  the  lower  triangular  and  diagonal  part  of  A-1  incrementally. 

It  is  not  difficult  to  observe  that  if  A  is  a  dense  matrix,  the  complexity  of  Algorithm  1  is 
0(n3)  because  a  matrix  vector  multiplication  involving  a  j  x  j  dense  matrix  is  performed  at 
the  jth  iteration  of  this  procedure,  and  (n  —  1)  iterations  are  required  to  fully  recover  A-1. 
Therefore,  when  A  is  dense,  this  procedure  does  not  offer  any  advantage  over  the  standard 
way  of  computing  A-1.  Furthermore,  all  elements  of  A-1  are  needed  and  computed.  No 
computation  cost  can  be  saved  if  we  just  want  to  extract  selected  elements  (e.g.,  the  diagonal 
elements)  of  A-1. 

However,  when  A  is  sparse,  a  tremendous  amount  of  saving  can  be  achieved  if  we  are 
only  interested  in  the  diagonal  components  of  A-1.  If  the  vector  i  in  (4)  is  sparse,  computing 
(T S~l£  does  not  require  all  elements  of  S'-1  to  be  obtained  in  advance.  Only  those  elements 
that  appear  in  the  rows  and  columns  corresponding  to  the  nonzero  rows  of  t  are  required. 

Therefore,  to  compute  the  diagonal  elements  of  A-1,  we  can  simply  modify  the  procedure 
shown  in  Algorithm  1  so  that  at  each  iteration  we  only  compute  selected  elements  of  A-1 
that  will  be  needed  by  subsequent  iterations  of  this  procedure.  It  turns  out  that  the  elements 
that  need  to  be  computed  are  completely  determined  by  the  nonzero  structure  of  the  lower 
triangular  factor  L.  To  be  more  specific,  at  the  jth  step  of  the  selected  inversion  process, 
we  compute  A~j  for  all  i  such  that  ^  0.  Therefore,  our  algorithm  for  computing  the 
diagonal  of  A-1  can  be  easily  described  by  a  MATLAB  script  (which  is  not  the  most  efficient 
implementation)  shown  in  Algorithm  2. 

To  see  why  this  type  of  selected  inversion  is  sufficient,  we  only  need  to  examine  the 
nonzero  structure  of  the  kth  column  of  L  for  all  k  <  j  since  such  a  nonzero  structure 
tells  us  which  rows  and  columns  of  the  trailing  sub-block  of  A^1  are  needed  to  complete 
the  calculation  of  the  (k,k)  entry  of  A-1.  In  particular,  we  would  like  to  find  out  which 
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Algorithm  2  A  MATLAB  script  for  computing  selected  matrix  elements  of  A-1  for  a 
sparse  symmetric  matrix  A. 

for  j  =  n-1 : -1 : 1 

7«  find  the  row  indices  of  the  nonzero  elements  in 

7«  the  j-th  column  of  L 

inz  =  j  +  f ind(L( j+1 :n, j ) ~=0) ; 

Ainv(inz,j)  =  -Ainv(inz,inz)*L(inz, j) ; 

Ainv(j,inz)  =  Ainv(inz, j) ’ ; 

Ainv(j , j)  =  l/d(j)  -  Ainv(j , inz)*L(inz, j) ; 
end; 


elements  in  the  jth  column  of  A  1  are  required  for  computing  A  - 1  for  any  k  <  j  and  i  >  j. 

Clearly,  when  Lj ^  =  0,  the  jth  column  of  A-1  is  not  needed  for  computing  the  /cth 
column  of  A-1.  Therefore,  we  only  need  to  examine  columns  k  of  L  such  that  Lj  ^  7^  0. 
A  perhaps  not  so  obvious  but  critical  observation  is  that  for  these  columns,  /  0  and 
Ljk  0  implies  Lij  7^  0  for  all  i  >  j.  Hence  computing  the  /cth  column  of  A-1  will  not 
require  more  matrix  elements  from  the  jth  column  of  A-1  than  those  that  have  already 
been  computed  (in  previous  iterations,)  i.e.  elements  A^j  such  that  Lij  /  0  for  i  >  j. 

These  observations  are  well  known  in  the  sparse  matrix  factorization  literature  [9,  12]. 
They  can  be  made  more  precise  by  using  the  notion  of  elimination  tree  [23].  In  such  a  tree, 
each  node  or  vertex  of  the  tree  corresponds  to  a  column  (or  row)  of  A.  Assuming  A  can  be 
factored  as  A  =  LDLT,  a  node  p  is  the  parent  of  a  node  j  if  and  only  if 


p  =  min{i  >  j\Ltj  /  0}. 


Figure  1:  The  lower  triangular  factor  L  of  a  sparse  10  x  10  matrix  A  and  the  corresponding 
elimination  tree. 

If  Ljfi  0  and  k  <  j,  then  the  node  k  is  a  descendant  of  j  in  the  elimination  tree.  An 
example  of  the  elimination  tree  of  a  matrix  A  and  its  L  factor  are  shown  in  Figure  1.  Such 
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a  tree  can  be  used  to  clearly  describe  the  dependency  among  different  columns  in  a  sparse 
LDLt  factorization  of  A.  In  particular,  it  is  not  too  difficult  to  show  that  constructing  the 
jth  column  of  L  requires  contributions  from  descendants  of  j  that  have  a  nonzero  matrix 
element  in  the  jth  row. 

Similarly,  we  may  also  use  the  elimination  tree  to  describe  which  selected  elements  within 
the  trailing  sub-block  A-1  are  required  in  order  to  obtain  the  (j,  j)-th  element  of  A _1.  In 
particular,  it  is  not  difficult  to  show  that  the  selected  elements  must  belong  to  the  rows  and 
columns  of  A -1  that  are  among  the  ancestors  of  j. 


3  Block  Algorithms  and  Supernodes 


The  selected  inversion  procedure  described  in  Algorithm  1  and  its  sparse  version  can  be 
modified  to  allow  a  block  of  rows  and  columns  to  be  modified  simultaneously.  A  block  algo¬ 
rithm  can  be  described  in  terms  a  block  factorization  of  A.  For  example,  if  A  is  partitioned 
as 


A  = 


An  Bl  \ 

£>21  A22  J 


its  block  LDLt  factorization  has  the  form 


A  = 


I 

L2i  I 


An 

A22  —  B2iA^  £?2i 


'f). 


(5) 


where  L21  =  B2iA111  and  S  =  A-22  —  B2\AXX  £>J]  is  the  Schur  complement.  The  correspond¬ 
ing  block  version  of  (4)  can  be  expressed  by 


4-1  _  (  A^  +  ^S-'Ln  -I&S-1} 

{  -s~1l21  s~l  J  ■ 


There  are  at  least  two  advantages  of  using  a  block  algorithm: 

1.  It  allows  us  to  use  level  3  BLAS  (Basic  Linear  Algebra  Subroutine)  to  develop  an 
efficient  implementation  by  exploiting  memory  hierarchy  in  modern  microprocessors; 

2.  When  applied  to  sparse  matrices,  it  tends  to  reduce  the  amount  of  indirect  addressing 
overhead. 


When  A  is  sparse,  the  columns  of  A  and  L  can  be  partitioned  into  supernodes.  A 
supernode  is  a  maximal  set  of  contiguous  columns  {j,  j  +  1, ...,  j  +  s}  of  L  that  have  the 
same  nonzero  structure  below  the  (j  +  s)-th  row,  and  the  lower  triangular  part  of  Lrj^sj:j^s 
is  completely  dense.  An  example  of  a  supernode  partition  of  the  lower  triangular  factor 
L  associated  with  a  49  x  49  sparse  matrix  A  is  shown  in  Figure  2.  The  definition  of  a 
supernode  can  be  relaxed  to  include  columns  whose  nonzero  structures  are  nearly  identical 
with  adjacent  columns.  However,  we  will  not  be  concerned  with  such  an  extension  in  this 
paper.  We  will  use  upper  case  script  letters  such  as  J  to  denote  a  supernode.  Following 
the  convention  introduced  in  [26],  we  will  interpret  J  either  as  a  supernode  index  or  a  set 
of  column  indices  contained  in  that  supernode  depending  on  the  context. 
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Figure  2:  A  supernode  partition  of  L. 


We  denote  the  set  of  row  indices  associated  with  the  nonzeros  rows  below  the  diagonal 
block  of  the  J th  supernode  by  Sj.  These  row  indices  are  further  partitioned  into  nj 
disjoint  subsets  Zi,Z2,  •••,21 riJ  such  that  Zj  contains  a  maximal  set  of  contiguous  row  indices 
and  Zj  C  /C  for  some  supernode  K,  >  J .  Here  K,  >  J  means  k  >  j  for  all  k  G  /C  and  j  G  J . 
In  Figure  3,  we  show  how  the  nonzero  rows  associated  with  one  of  the  supernodes  (the  26-th 
supernode  which  begins  at  column  27)  are  partitioned.  The  purpose  of  the  partition  is  to 
create  dense  submatrices  of  L  that  can  be  easily  accessed  and  manipulated.  The  reason  we 
impose  the  constraint  Z*  C  /C,  which  is  normally  not  required  in  the  LDLr  factorization  of 
A,  will  become  clear  in  section  4.  We  should  also  note  that,  under  this  partitioning  scheme, 
it  is  possible  that  Z,  and  Z?  belong  to  the  same  supernode  even  if  i  /  j. 

The  use  of  supernodes  leads  to  a  necessary  but  straightforward  modification  of  the 
elimination  tree.  All  nodes  associated  with  columns  within  the  same  supernode  are  collapsed 
into  a  single  node.  The  modified  elimination  tree  describes  the  dependency  among  different 
supernodes  in  a  supernode  LDLT  factorization  of  A  (see  [26,  28]).  Such  dependency  also 
defines  the  order  by  which  selected  blocks  of  A-1  are  computed. 

Using  the  notion  of  supernodes,  we  can  modify  the  selected  inversion  process  described 
by  the  MATLAB  script  shown  in  Algorithm  2  to  make  it  more  efficient.  If  columns  of  L  can 
be  partitioned  into  nsup  supernodes,  a  supernode  based  block  selected  inversion  algorithm 
can  be  described  by  the  pseudocode  shown  in  Algorithm  3. 

4  Implementation  Details 

We  now  describe  some  of  the  implementation  details  that  allow  the  selected  inversion  process 
described  schematically  in  Algorithm  3  to  be  carried  out  in  an  efficient  manner. 
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Algorithm  3  A  supernode-based  algorithm  for  computing  the  selected  elements  of  A  1 
Input:  (1)  The  supernode  partition  of  columns  of  A:  {1,2,  ...,nsup}; 

(2)  A  supernode  LDLT  factorization  of  A; 

Output:  Selected  elements  of  A-1,  i.e.  (A-1),^  such  that  Li:j  ^  0. 

1:  Compute  A~]up  ngup  = 

2:  for  {J  —  Y^sup  1?  Tisup  •••?  1  do 

3:  Identify  the  non-zero  rows  in  the  Jth.  supernode  Sj\ 

4:  Perform  Y  =  A~s^SjLSj,j\ 

5:  Calculate  .4 ^  ■  D ^  Y^  Lgj 

6:  Set  A jA  j  < - Y ; 

7:  end  for 


We  assume  a  supernode  LDLT  factorization  has  been  performed  using,  for  example, 
an  efficient  left-looking  algorithm  described  in  [26,  28].  Such  an  algorithm  typically  stores 
the  nonzero  elements  of  L  in  a  contiguous  array  using  the  compressed  column  format  [8]. 
This  array  will  be  overwritten  by  the  selected  elements  of  A-1.  The  row  indices  associated 
with  the  nonzero  rows  of  each  supernode  are  stored  in  a  separate  integer  array.  Several 
additional  integer  arrays  are  used  to  mark  the  supernode  partition  and  column  offsets. 

As  we  illustrated  in  Algorithm  2,  the  selected  inversion  process  proceeds  backwards  from 
the  last  supernode  nsup  towards  the  first  supernode.  For  all  supernodes  J  <  nsup,  we  need 
to  perform  a  matrix-matrix  multiplication  of  the  form 

Y  =  ( a~1)sj,SjLSj,j ,  (6) 

where  J  serves  the  dual  purposes  of  being  a  supernode  index  and  an  index  set  that  contains 
all  column  indices  belonging  to  the  J th  supernode,  and  Sj  denotes  the  set  of  row  indices 
associated  with  nonzero  rows  within  the  J  th  supernode  of  L. 

Recall  that  the  row  indices  contained  in  Sj  are  partitioned  into  a  number  of  disjoint 
subsets  X\ such  that  2}  C  K,  for  some  supernode  K,  >  J .  Such  a  partition 
corresponds  to  a  row  partition  of  the  dense  matrix  block  associated  with  the  JtYi  supernode 
into  n  j  submatrices.  The  same  partition  is  applied  to  the  rows  and  columns  of  the  submatrix 
(A_1)5j!5J  except  that  this  submatrix  is  not  stored  in  a  contiguous  array.  For  example, 
the  nonzero  row  indices  of  the  26-th  supernode  in  Figure  2,  which  consists  of  columns  27, 
28  and  29,  can  be  partitioned  as 

S2e  =  {30}  U  {40, 41}  U  {43, 44, 45}. 

This  partition  as  well  as  the  corresponding  partition  of  the  blocks  in  the  trailing  A-1  that 
are  used  in  (6)  is  highlighted  in  Figure  3. 

In  order  to  carry  out  the  matrix-matrix  multiplication  (6)  we  must  identify  the  location 
of  each  subblock  of  A^  (within  the  storage  allocated  for  L )  one  by  one  as  we  accumulate 
the  products  of  these  subblocks  and  the  corresponding  subblocks  of  L$j,j  in  a  work  array 
which  we  denote  by  Y.  Furthermore,  because  A-1  is  symmetric,  we  store  only  the  selected 
nonzero  elements  in  the  lower  triangular  part  of  the  matrix.  Hence,  our  implementation 
of  (6),  which  is  carefully  described  in  Algorithm  4,  makes  uses  of  the  transposes  of  the 
subblocks  in  the  lower  triangular  part  of  AjJ.  s  (line  10). 


27 


Figure  3:  The  partition  of  the  nonzero  rows  in  S26  and  the  matrix  elements  needed  in 
^30:49,30:49  for  the  Computation  of  A^g  30.49Z30:39, 27:29- 


To  identify  the  location  of  each  subblock  of  A~^  s  required  in  (6),  we  use  an  integer 
array  indmap  with  n  entries  to  record  the  relative  row  positions  of  the  first  row  of  Zj  in 
Y .  for  i  =  2,3,  To  be  specific,  the  indmap  array  is  initialized  to  have  0’s  in  all  its 

entries.  If  k  is  an  element  in  Z,,  (all  elements  in  Zj  are  sorted  in  an  ascending  order),  then 
indmap  [k]  is  set  to  be  the  relative  distance  of  row  k  from  the  last  row  of  in  the  diagonal 
block  of  the  J th  supernode  in  L.  For  example,  S26  contains  columns  27,  28, 29  and  is  shown 
as  the  leftmost  supernode  in  Figure  3.  The  nonzero  entries  of  the  indmap  array  for  S26  are 

indmap [30]  =  1 , 
indmap [40]  =  2 , 
indmap [41]  =  3, 
indmap [43]  =  4 , 
indmap [44]  =  5 , 
indmap [45]  =  6 . 

A  similar  indirect  addressing  scheme  is  used  in  [26]  for  the  gathering  and  scattering  opera¬ 
tions  used  in  the  LDLT  factorization  of  A. 

Once  the  indmap  array  is  properly  set  up,  the  subblock  searching  process  indicated  in 
line  7  of  the  pseudocode  shown  in  Algorithm  4  goes  through  the  row  indices  k  of  the  nonzero 
rows  of  the  1C  th  supernode  until  a  nonzero  indmap  [k]  is  found.  A  separate  pointer  p  to 
the  floating  point  array  allocated  for  L  is  incremented  at  the  same  time.  When  a  nonzero 
indmap  [k]  is  found,  the  position  in  the  floating  point  array  pointed  by  p  gives  the  location 
of  (A-1)^^.  required  in  line  9  of  the  special  matrix-matrix  multiplication  procedure  shown 
in  Algorithm  4.  Meanwhile,  the  value  of  indmap  [k]  gives  the  location  of  the  target  work 
array  Y  at  which  the  product  of  (A~1)jjtjj  and  Ljjtj  is  accumulated. 
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Algorithm  4  Sparse  matrix-matrix  multiplication. 

Input:  (1)  The  Jt h  supernode  of  L,  LsJtj,  where  Sj  contains  the  indices  of 

the  nonzero  rows  in  J .  The  index  set  Sj  is  partitioned  into  disjoint 
nj  subsets  of  contiguous  indices,  i.e.  Sj  =  {Zi,Z2, 

(2)  The  nonzero  elements  of  A~x  that  have  been  computed  previously. 
These  elements  are  stored  in  Lsk,k.  f°r  all  K.  >  J\ 

Output:  Y  =  {A-x)Sj,SjLSj,j\ 

1:  Construct  an  indmap  array  for  nonzero  rows  in  the  J th  supernode; 

2:  for  j  =  1,2,..., nj  do 

3:  Identify  the  supernode  K,  that  contains  Zj; 

4:  Let  IZi  =  indmap(Zj); 

5:  Calculate  Tr^*  <—  Ynlt*  +  {A~1)xj,xjLxj,j\ 

6:  for  *  =  j  +  1,  j  +  2,  ...nj  do 

7:  Use  indmap  to  find  the  first  nonzero  row  in  the  /Cth  supernode  that  belongs 

to  li  so  that  {A~1)xitxj  can  be  located; 

8:  Let  IZ2  =  indmap(Zj); 

9:  Calculate  Tr2,*  <—  Tr2,*  +  {A~l)xi,xjLxj,j\ 

10:  Calculate  <—  Tri,*  +  [(A~1)xitxj]T Lxitj', 

11:  end  for 

12:  end  for 

13:  Reset  the  nonzero  entries  of  indmap  to  zero; 


Before  we  copy  Y  to  the  appropriate  location  in  the  array  that  stores  the  ^7th  supernode 
of  L,  we  need  to  compute  the  diagonal  block  of  A within  this  supernode  by  the  following 
update: 

(^_1)z,77  =  +  YTLsj,j , 

where  (A-1)  j%j,  which  is  stored  in  the  diagonal  block  of  the  storage  allocated  for  L,  contains 
the  inverse  of  the  diagonal  block  Djtj  produced  by  the  supernode  LDLT  factorization 
before  the  update  is  performed. 

5  Performance 

In  this  section  we  report  the  performance  of  our  selected  inversion  algorithm  Sellnv.  Our 
performance  analysis  is  carried  out  on  the  Franklin  cluster  maintained  at  NERSC.  Each 
compute  node  consists  of  a  2.3  GHz  single  socket  quad-core  AMD  Opteron  processor  (Bu¬ 
dapest)  with  a  theoretical  peak  performance  of  9.2  GFlops/sec  per  core  (4  flops/cycle  if 
using  SSE128  instructions).  Each  core  has  2  GB  of  memory.  Our  test  problems  are  taken 
from  Harwell-Boeing  Test  Collection  [8]  and  the  University  of  Florida  Matrix  Collection  [7]. 
These  matrices  are  widely  used  benchmark  problems  for  sparse  direct  methods.  The  names 
of  these  matrices  as  well  as  some  of  their  characteristics  are  listed  in  Table  1  and  2.  All 
matrices  are  real  and  symmetric.  The  multiple  minimum  degree  (MMD)  matrix  reordering 
strategy  [22]  is  used  to  minimize  the  amount  of  nonzero  fills  in  L.  We  used  the  supernodal 
left-looking  algorithm  and  code  provided  by  the  authors  of  [26]  to  perform  the  LDLT  fac¬ 
torization  of  A.  Table  3  gives  the  performance  result  in  terms  of  computational  time  as  well 
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Problem 
bcsstkl4 
bcsstk24 
bcsstk28 
bcsstkl8 
bodyyG 
crystm03 
wathenl20 
thermal  1 
shipsecl 
pwtk 

parabolic_fem 
trnt  syrri 
ecology2 
G3  .circuit 


Description 

Roof  of  the  Omni  Coliseum,  Atlanta. 

Calgary  Olympic  Saddledome  arena. 

Solid  element  model,  linear  statics. 

R.E.  Ginna  Nuclear  Power  Station. 

NASA,  Alex  Pothen. 

FEM  crystal  free  vibration  mass  matrix. 

Gould, Higharn, Scott:  matrix  from  Andy  Wathen,  Oxford  Univ. 
Unstructured  FEM,  steady  state  thermal  problem,  Dani  Schmid,  Univ.  Oslo. 
DNV-Ex  4  :  Ship  section/detail  from  production  run-1999-01-17. 
Pressurized  wind  tunnel,  stiffness  matrix. 
Diffusion-convection  reaction,  constant  homogeneous  diffusion. 
Symmetric  electromagnetic  problem,  David  Isaak,  ComputationaLEMJWorks. 
Circuitscape:  circuit  theory  applied  to  animal/gene  flow,  B.  McRae,  UCSB. 
Circuit  simulation  problem,  Ufuk  Okuyucu,  AMD,  Inc. 


Table  1:  Test  problems 


as  floating  point  operations  per  second  (flops)  for  both  the  factorization  and  the  selected 
inversion  algorithms  respectively.  We  also  report  the  average  flops. 

The  dimension  of  the  matrices  we  tested  ranges  from  problem  size  ranges  from  2, 000 
to  1.5  million,  and  the  number  of  non-zero  elements  in  the  L  factor  ranges  from  0.1  million 
to  0.2  billion.  For  the  largest  problem  G3_circuit,  the  overall  computation  takes  only  350s. 
Among  these  problems,  the  best  performance  is  obtained  with  the  problem  pwtk.  For 
this  particular  problem,  the  factorization  part  attains  26%  of  the  peak  performance  of  the 
machine,  and  the  selected  inversion  part  attains  68%  of  the  peak  flops.  The  average  flops 
ratio  is  46%. 

To  demonstrate  how  much  we  can  gain  by  using  the  selected  inversion  algorithm  instead 
of  the  naive  approach  of  inverting  A  directly  through  (2),  which  we  will  refer  to  as  the 
direct  inversion,  we  list  the  timing  statistics  for  both  approaches  in  Table  4  as  well  as  the 
speedup  factor.  The  speedup  factor  is  defined  by  the  time  for  selected  inversion  divided  by 
the  time  for  direct  inversion.  In  this  comparison  selected  inversion  just  refers  to  the  second 
part  of  Sellnv.  The  time  for  factorization  part  is  not  counted,  since  factorization  is  shared 
between  both  algorithms.  Even  for  the  smallest  problem  bcsstkl4,  the  speedup  factor  is 
108.  For  the  largest  problem,  selected  inversion  uses  219s,  while  the  direct  inversion  requires 
an  estimated  21.8  days.  Among  these  problems,  the  largest  speedup  gain  is  achieved  for 
problem  ecology2.  In  this  case  the  selected  inversion  is  18,  000  times  faster  than  the  direct 
full  inversion. 
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problem 

n 

\L\ 

bcsstkl4 

1,806 

32,630 

112,267 

bcsstk24 

3,562 

81,736 

278,922 

bcsstk28 

4,410 

111,717 

346,864 

bcsstkl8 

11,948 

80,519 

662,725 

bodyyG 

19,366 

77,057 

670,812 

crystm03 

24,696 

304,233 

3,762,633 

wathenl20 

36,441 

301,101 

2,624,133 

thermall 

82,654 

328,556 

2,690,654 

shipsecl 

140,874 

3,977,139 

40,019,943 

pwtk 

217,918 

5,926,171 

56,409,307 

parabolic  Jem 

525,825 

2,100,225 

34,923,113 

tmt  sym 

726,713 

2,903,837 

41,296,329 

ecology2 

999,999 

2,997,995 

38,516,672 

G  3  .circuit 

1,585,478 

4,623,152 

197,137,253 

Table  2:  Characteristic  of  the  test  problems 


problem 

factorization 

time(sec) 

factorization 

flops(G/sec) 

selected  inversion 
time(sec) 

selected  inversion 
flops(G/sec) 

average 
flops  (G/sec) 

bcsstkl4 

0.007 

1.43 

0.010 

2.12 

1.85 

bcsstk24 

0.019 

1.75 

0.020 

3.65 

2.71 

bcsstk28 

0.023 

1.63 

0.024 

3.46 

2.54 

bcsstkl8 

0.080 

1.80 

0.235 

1.54 

1.60 

bodyy6 

0.044 

1.49 

0.090 

1.68 

1.61 

crystm03 

0.452 

2.26 

0.779 

2.95 

2.70 

wathenl20 

0.251 

2.12 

0.344 

3.47 

2.90 

thermall 

0.205 

1.53 

0.443 

1.66 

1.62 

shipsecl 

18.48 

2.38 

17.66 

5.45 

3.88 

pwtk 

16.43 

2.48 

14.55 

6.28 

4.26 

parabolicJem 

6.649 

2.34 

20.06 

1.91 

2.02 

tmt_sym 

10.64 

2.35 

13.98 

4.02 

3.30 

ecology2 

6.789 

2.32 

16.04 

2.35 

2.34 

G3  .circuit 

136.5 

2.24 

218.7 

3.27 

2.88 

Table  3:  The  time  cost,  and  flops  result  for  factorization  and  selected  inversion  process 
respectively.  The  last  column  reports  the  average  flops  reached  by  Sellnv. 
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problem 

selected  inversion 

time 

direct  inversion 

time 

speedup 

bcsstkl4 

0.010  sec 

1.080  sec 

108 

bcsstk24 

0.020  sec 

5.360  sec 

268 

bcsstk28 

0.024  sec 

8.930  sec 

372 

bcsstkl8 

0.235  sec 

53.37  sec 

227 

bodyyG 

0.090  sec 

104.6  sec 

1162 

crystm03 

0.779  sec 

577.9  sec 

742 

wathenl20 

0.344  sec 

655.9  sec 

1906 

thermall 

0.443  sec 

0.539  hr 

4308 

shipsecl 

17.66  sec 

8.933  hr 

1820 

pwtk 

14.55  sec 

19.56  hr 

4839 

parabolicJem 

20.06  sec 

1.521  day 

6551 

tmt_sym 

13.98  sec 

2.486  day 

15364 

ecology2 

16.04  sec 

3.374  day 

18174 

G3_circuit 

218.7  sec 

21.78  day 

8604 

Table  4:  Time  cost  comparison  between  selected  inversion  and  direct  inversion.  The  speedup 
factor  is  defined  by  the  time  cost  of  direct  inversion  divided  by  the  time  cost  of  selected 
inversion. 

6  Application  to  electronic  structure  calculation  of  Aluminum 

In  this  section,  we  show  how  Sellnv  can  be  applied  to  electronic  structure  calculations  within 
the  density  functional  theory  (DFT)  framework  [15,  16].  The  most  time  consuming  part  of 
these  calculations  is  the  evaluation  of  the  electron  density 

p  =  diag  (fp,n(H)),  (7) 

where  and  =  1/(1  +  e^~^)  is  the  Fermi-Dirac  distribution  function  with  /3  being  a 

parameter  that  is  proportional  to  the  reciprocal  of  the  temperature  and  p  being  the  chemical 
potential.  The  symmetric  matrix  H  in  (7)  is  a  discretized  Kohn-Sham  Hamiltonian  [24] 
defined  as 

H  =  ~\  A  +  Vpse{r)  +  VH{r)  +  Vxc{r),  (8) 

where  A  is  the  Laplacian,  Vh  is  the  Hartree  potential,  Vxc  is  the  exchange-correlation 
potential  constructed  via  the  local  density  approximation  (LDA)  theory  [24]  and  FpSe  is  the 
real  space  Troullier-Martins  ionic  pseudopotential  [5]. 

The  standard  approach  for  evaluating  (7)  is  to  compute  the  invariant  subspace  associated 
with  a  few  smallest  eigenvalues  of  H.  This  approach  is  used  in,  for  example,  PARSEC  [1], 
which  is  a  real  space  electronic  structure  calculation  software  package. 

An  alternative  way  to  evaluate  (7)  is  to  use  a  recently  developed  pole  expansion  technique 
[18,  20]  to  approximate  The  pole  expansion  technique  expresses  electron  density  p  as 
a  linear  combination  of  the  diagonal  of  (H  —  (p  +  z,)/)-1,  i.e. 

(jag  g  _(/+„)/)■  <9) 
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Here  Im  ( H )  stands  for  the  imaginary  part  of  H.  The  parameters  z*  and  uii  are  the  complex 
shift  and  weight  associated  to  the  7-th  pole  respectively.  They  can  be  chosen  so  that  the  total 
number  of  poles  P  is  minimized  for  a  given  accuracy  requirement.  At  room  temperature,  the 
number  of  poles  required  in  (9)  is  relatively  small  (less  than  80).  In  addition  to  temperature, 
the  pole  expansion  (9)  also  requires  an  explicit  knowledge  of  the  chemical  potential  /i,  which 
must  be  chosen  so  that  the  condition 

trac  e(fp^(H))  =  ne  (10) 

is  satisfied.  This  can  be  accomplished  by  solving  (10)  using  the  standard  Newton’s  method. 

In  order  to  use  (9) ,  we  need  to  compute  the  diagonal  of  the  inverse  of  a  number  of  complex 
symmetric  (non-Hermitian)  matrices  H  —  (zi+fj,)I  ( i  =  1,  2, ...,  P).  A  fast  implementation  of 
the  Sellnv  algorithm  described  in  section  4  can  be  used  to  perform  this  calculation  efficiently, 
as  the  following  example  shows. 

The  example  we  consider  here  is  a  quasi-2D  aluminum  system  with  a  periodic  boundary 
condition.  For  simplicity,  we  only  use  a  local  pseudopotential  in  (8),  i.e.  Vrpse(r)  is  a  diagonal 
matrix.  The  Laplacian  operator  A  is  discretized  using  a  second-order  five-point  stencil.  A 
room  temperature  of  300K  (which  defines  the  value  of  (3 )  is  used.  The  aluminum  system 
has  a  face  centered  cubic  (FCC)  crystal  structure.  We  include  5  unit  cells  along  both  x  and 
y  directions,  and  1  unit  cell  along  the  z  direction  in  our  computational  domain.  Each  unit 
cell  is  cubic  with  a  lattice  constant  of  4.05A.  Therefore,  there  are  altogether  100  aluminum 
atoms  and  300  valence  electrons  in  the  experiment.  The  position  of  each  aluminum  atom 
is  perturbed  from  its  original  position  in  the  crystal  by  a  random  displacement  around 
1CT3A  so  that  no  point  group  symmetry  is  assumed  in  our  calculation.  The  grid  size  for 
discretization  is  set  to  0.2lA.  The  resulting  Hamiltonian  matrix  size  is  159, 048. 

We  compare  the  density  evaluation  (7)  performed  by  both  PARSEC  and  the  pole  ex¬ 
pansion  technique.  In  PARSEC,  the  invariant  subspace  associated  with  the  smallest  310 
eigenvalues  are  computed  using  ARPACK.  This  calculation  takes  2,490  seconds.  In  the 
pole  expansion  approach,  we  use  60  poles  in  (9),  which  gives  a  comparable  relative  error  in 
electron  density  on  the  order  of  10“5  (in  L\  norm.)  The  MMD  reordering  scheme  is  used  to 
reduce  the  amount  of  fill  in  the  LDL1  factorization.  In  addition  to  using  the  selected  inver¬ 
sion  algorithm  to  evaluate  each  term  in  (9),  an  extra  level  of  coarse  grained  parallelism  can 
be  utilized  by  assigning  each  pole  to  a  different  processor.  The  evaluation  of  each  term  in  (9) 
takes  roughly  1,950  seconds.  Therefore,  the  total  amount  of  time  required  to  evaluate  (7) 
on  a  single  core  is  1, 950  x  60  seconds.  As  a  result,  the  performance  of  the  selected  inversion 
based  pole  expansion  approach  is  only  comparable  to  the  invariant  subspace  computation 
approach  used  in  PARSEC  if  the  extra  level  of  coarse  grained  parallelism  is  used. 

A  3D  isosurface  plot  of  the  electron  density  as  well  as  the  electron  density  plot  restricted 
on  the  z  =  0  plane  are  shown  in  Figure  4. 

We  also  remark  that  the  efficiency  of  selected  inversion  can  be  further  improved.  One  of 
the  factors  that  have  prevented  the  Sellnv  from  achieving  even  higher  performance  is  that 
most  of  the  supernodes  produced  from  the  MMD  ordering  of  H  contains  only  1  column  even 
though  many  of  these  supernodes  have  similar  (but  not  identical)  nonzero  structures.  Con¬ 
sequently,  both  the  factorization  and  inversion  are  dominated  by  level-1  BLAS  operations. 
Further  performance  gain  is  likely  to  be  achieved  if  we  relax  the  definition  of  a  supernode 
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Figure  4:  (a)3D  isosurface  plot  of  the  electron  density  together  with  the  electron  density 
restricted  to  2  =  0  plane,  (b)  The  electron  density  restricted  to  z  =  0  plane. 


and  treat  some  of  the  zeros  in  L  as  non-zeros.  This  approach  has  been  demonstrated  to  be 
extremely  helpful  in  [3]. 

7  Concluding  Remarks 

We  presented  an  efficient  sequential  algorithm  for  computing  selected  components  of  the 
inverse  of  a  general  sparse  symmetric  matrix  A ,  and  described  its  implementation  Sellnv. 
Our  algorithm  consists  of  two  steps.  In  the  first  step,  we  perform  an  LDLT  factorization 
of  the  matrix  A  using  a  supernodal  left-looking  LDLT  factorization  algorithm  developed 
in  [26].  This  step  can  also  be  implemented  using  other  existing  software  packages  such  as 
[26,  2,  29,  4,  14,  13].  In  the  second  step,  a  selected  inversion  algorithm  specifically  designed 
for  the  supernodal  nonzero  structure  of  L  is  used  to  compute  the  nonzero  blocks  of  A~1 
that  have  a  corresponding  nonzero  block  in  L.  The  use  of  supernodes  enables  us  to  exploit 
the  memory  hierarchy  of  modern  microprocessors  to  achieve  high  performance. 

We  demonstrated  the  efficiency  of  our  implementation  of  the  selected  inversion  algorithm 
by  applying  our  code  Sellnv  to  a  variety  of  benchmark  problems  with  dimension  as  large  as 
1.5  million.  We  were  able  to  achieve  a  relatively  high  percentage  of  the  peak  performance 
on  the  high  performance  machine  we  we  used  to  conduct  our  experiments.  In  one  case,  we 
were  able  to  achieve  68%  of  the  peak  performance. 

We  also  demonstrated  how  Sellnv  can  be  applied  to  the  electronic  structure  calculation 
of  an  aluminum  system  using  a  pole  expansion  technique  [18,  20].  We  compared  the  effi¬ 
ciency  of  our  algorithm  with  the  standard  real  space  electronic  structure  calculation  software 
PARSEC.  Our  comparison  showed  that  the  performance  of  the  pole  expansion  approach  is 
comparable  to  that  of  PARSEC  if  a  coarse-grained  parallelization  of  the  poles  expansion  is 
used. 

To  solve  problems  with  more  degrees  of  freedom,  the  selected  inversion  algorithm  itself 
must  be  parallelized  on  distributed  memory  parallel  computers.  This  is  a  research  direction 
that  we  plan  to  pursue  in  the  near  future. 
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